HOTFIX: 127.0.0.1 not localhost in verifier healthcheck (B02 Phase 2 follow-up)#36
Merged
Merged
Conversation
The deploy after PR #35 succeeded in building both containers but the verifier never became 'healthy' from Docker's perspective: Connecting to localhost:3001 ([::1]:3001) wget: can't connect to remote host: Connection refused Root cause: alpine ships busybox wget. Busybox wget resolves `localhost` to ::1 (IPv6) first and does NOT fall back to 127.0.0.1 (IPv4) on refusal. The verifier binds 0.0.0.0 (IPv4-only). Connection refused on every healthcheck, container marked unhealthy after 3 retries, zeroauth-prod (which depends on it via depends_on: service_healthy) never started. Result: prod was 502 for ~3 minutes between 07:21 UTC and 07:25 UTC until I manually started zeroauth-prod with --no-deps via SSH. That restored service. The verifier was running and responding to requests fine the whole time — only the healthcheck command was wrong. Fix: use the literal 127.0.0.1 in both the Dockerfile HEALTHCHECK and the compose-level healthcheck. The two are redundant by design: compose-level wins for `docker compose` orchestration; Dockerfile HEALTHCHECK wins for `docker run` outside compose. Both need to be correct. Comment added in both places explaining why localhost is wrong, so the next operator doesn't revert. Production state right now: zeroauth-prod is up + healthy via the manual --no-deps recovery. The verifier is up + responding but marked unhealthy by Docker (cosmetic — it doesn't block anything since prod is now running without the dependency wait). After this hotfix deploys, both will be healthy and the dependency edge reactivates on next restart. Verified locally: docker exec zeroauth-verifier wget -qO- http://127.0.0.1:3001/health → {"status":"ok","version":"0.1.0","vkeyAvailable":true,...} Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pulkitpareek18
added a commit
that referenced
this pull request
May 15, 2026
Task 4 of today. Formally records the decision Pulkit made yesterday when he picked Plan B over Plan A. Captures the three reasons single-engineer velocity beat the brainstorm's Rust spec, what we gave up (reproducible-build provenance, smaller transitive surface, unsafe-discipline) and what we kept (cross-repo HTTP shape stays Rust-compatible if we ever swap). Also pins the inline-fallback retirement plan: - 2026-05-15: verifier shipped, inline path unused but compiled-in - 2026-05-16 → 2026-06-06: 3-week soak in prod - 2026-06-08: PR to delete verifyInline + snarkjs from root deps + refuse-to-start when VERIFIER_URL is unset - 2026-06-09: prod runs verifier-only References the three shipping PRs (#35 cutover, #36 healthcheck hotfix, #37 SQLite audit log) + the plan-mode design doc + the B02 build prompt that we rejected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pulkitpareek18
added a commit
that referenced
this pull request
May 15, 2026
Task 4 of today. Formally records the decision Pulkit made yesterday when he picked Plan B over Plan A. Captures the three reasons single-engineer velocity beat the brainstorm's Rust spec, what we gave up (reproducible-build provenance, smaller transitive surface, unsafe-discipline) and what we kept (cross-repo HTTP shape stays Rust-compatible if we ever swap). Also pins the inline-fallback retirement plan: - 2026-05-15: verifier shipped, inline path unused but compiled-in - 2026-05-16 → 2026-06-06: 3-week soak in prod - 2026-06-08: PR to delete verifyInline + snarkjs from root deps + refuse-to-start when VERIFIER_URL is unset - 2026-06-09: prod runs verifier-only References the three shipping PRs (#35 cutover, #36 healthcheck hotfix, #37 SQLite audit log) + the plan-mode design doc + the B02 build prompt that we rejected. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pulkitpareek18
added a commit
that referenced
this pull request
May 15, 2026
) The deploy after PR #35 succeeded in building both containers but the verifier never became 'healthy' from Docker's perspective: Connecting to localhost:3001 ([::1]:3001) wget: can't connect to remote host: Connection refused Root cause: alpine ships busybox wget. Busybox wget resolves `localhost` to ::1 (IPv6) first and does NOT fall back to 127.0.0.1 (IPv4) on refusal. The verifier binds 0.0.0.0 (IPv4-only). Connection refused on every healthcheck, container marked unhealthy after 3 retries, zeroauth-prod (which depends on it via depends_on: service_healthy) never started. Result: prod was 502 for ~3 minutes between 07:21 UTC and 07:25 UTC until I manually started zeroauth-prod with --no-deps via SSH. That restored service. The verifier was running and responding to requests fine the whole time — only the healthcheck command was wrong. Fix: use the literal 127.0.0.1 in both the Dockerfile HEALTHCHECK and the compose-level healthcheck. The two are redundant by design: compose-level wins for `docker compose` orchestration; Dockerfile HEALTHCHECK wins for `docker run` outside compose. Both need to be correct. Comment added in both places explaining why localhost is wrong, so the next operator doesn't revert. Production state right now: zeroauth-prod is up + healthy via the manual --no-deps recovery. The verifier is up + responding but marked unhealthy by Docker (cosmetic — it doesn't block anything since prod is now running without the dependency wait). After this hotfix deploys, both will be healthy and the dependency edge reactivates on next restart. Verified locally: docker exec zeroauth-verifier wget -qO- http://127.0.0.1:3001/health → {"status":"ok","version":"0.1.0","vkeyAvailable":true,...}
pulkitpareek18
added a commit
that referenced
this pull request
May 15, 2026
Task 4 of today. Formally records the decision Pulkit made yesterday when he picked Plan B over Plan A. Captures the three reasons single-engineer velocity beat the brainstorm's Rust spec, what we gave up (reproducible-build provenance, smaller transitive surface, unsafe-discipline) and what we kept (cross-repo HTTP shape stays Rust-compatible if we ever swap). Also pins the inline-fallback retirement plan: - 2026-05-15: verifier shipped, inline path unused but compiled-in - 2026-05-16 → 2026-06-06: 3-week soak in prod - 2026-06-08: PR to delete verifyInline + snarkjs from root deps + refuse-to-start when VERIFIER_URL is unset - 2026-06-09: prod runs verifier-only References the three shipping PRs (#35 cutover, #36 healthcheck hotfix, #37 SQLite audit log) + the plan-mode design doc + the B02 build prompt that we rejected.
pulkitpareek18
added a commit
that referenced
this pull request
May 28, 2026
First issue of the BFSI v1 compliance roadmap, owned by Agent #36 (Chief Compliance Officer). Covers the four certification tracks that gate the 12-month plan: DPDP Act 2023, the four binding RBI Master Directions (IT Governance, Digital Lending, Digital Payment Security Controls, KYC), SOC 2 Type I + Type II, and ISO/IEC 27001:2022. The RBI Sandbox application is tracked alongside as a Q3 deliverable. Eight sections per the agent-36 W1-Mon ticket: 1. Scope (in/out + India primary, GCC/UK secondary v2 lookahead). 2. Frameworks tracked with auditor + counsel relationships. 3. Q1-Q4 milestones aligned to the phase map in docs/plan/bfsi-v1/00-README.md. 4. Per-quarter deliverables table (D-Qn-NN IDs, owner agent, target week, dependencies) covering the year end-to-end. 5. Audit calendar weeks 1-52 listing every external interaction. 6. Vendor + counsel calendar (DPDP counsel, external cryptographer, SOC 2 auditor, ISO lead auditor, smart-contract audit firm, RBI counsel, bug bounty platform, evidence collector tool). 7. Open dependencies + risks (R-COMP-01..08) with owner + mitigation for each. Explicitly captures the three risks called out in the ticket: DPDP rule notification mid-evidence, evidence-collector tool slip, trusted-setup ceremony slip blocking ISO certification. 8. Document hygiene rules: quarterly retros in docs/compliance/retros/, regulator interaction log in docs/compliance/regulator-log.md, evidence pack rotation each quarter. Cross-references docs/plan/bfsi-v1/06-ways-of-working.md for the escalation path and docs/threat_model.md for the attack catalogue that control narratives map to. Calls out the trusted-setup ceremony artefact at docs/cryptography/trusted-setup-ceremony.md as the input to ISO Annex A.5.31 and SOC 2 CC6.1 evidence. [no-test] markdown-only deliverable per ticket. Reviewer: Agent #1.
pulkitpareek18
pushed a commit
that referenced
this pull request
May 28, 2026
First issue of the enterprise risk register at docs/compliance/risk/enterprise-risk-register-v1.md. Captures the 10 baseline commercial, operational, regulatory, strategic, security, and financial risks that the founder, CCO, CRO, and Risk & Audit lead carry on their dashboards. Distinct from docs/threat_model.md, which holds the technical attack catalogue (A-NN rows). Each enterprise risk references the threat-model rows it relates to so the two documents stay bidirectionally linked per the §6.5 operating principle. Document deliverable A40-W1-Mon from docs/plan/bfsi-v1/agents/agent-40-risk-audit.md. Pairs with the compliance roadmap at docs/compliance/compliance-roadmap-v1.md whose §7 holds the thinner compliance-bearing subset; this register is the authoritative copy. References docs/threat_model.md throughout (A-02, A-07, A-09, A-10, A-13, A-17, A-21, A-22, A-28) and docs/cryptography/trusted-setup-ceremony.md (R-ENT-04, R-ENT-07) and docs/compliance/privacy/data-inventory-v1.md (R-ENT-03 scoping). Risks classified by likelihood (1..5) x impact (1..5) with appetite bands accept <= 6, review 7-12, reject >= 13. At v1 all residuals sit in the auto-accept band after mitigation. Cadence is weekly walk by Agent #40, monthly review with Agent #1 + #36 + #42 on the 15th, quarterly board review in the last week of each Q, plus event-driven triggers per §6.3. Sign-offs in §7. [no-test] markdown-only documentation deliverable. Next review 2026-06-01 per A40-W2-Mon ticket which updates the register with commit hashes for closed mitigations.
pulkitpareek18
added a commit
that referenced
this pull request
May 28, 2026
First issue of the BFSI v1 compliance roadmap, owned by Agent #36 (Chief Compliance Officer). Covers the four certification tracks that gate the 12-month plan: DPDP Act 2023, the four binding RBI Master Directions (IT Governance, Digital Lending, Digital Payment Security Controls, KYC), SOC 2 Type I + Type II, and ISO/IEC 27001:2022. The RBI Sandbox application is tracked alongside as a Q3 deliverable. Eight sections per the agent-36 W1-Mon ticket: 1. Scope (in/out + India primary, GCC/UK secondary v2 lookahead). 2. Frameworks tracked with auditor + counsel relationships. 3. Q1-Q4 milestones aligned to the phase map in docs/plan/bfsi-v1/00-README.md. 4. Per-quarter deliverables table (D-Qn-NN IDs, owner agent, target week, dependencies) covering the year end-to-end. 5. Audit calendar weeks 1-52 listing every external interaction. 6. Vendor + counsel calendar (DPDP counsel, external cryptographer, SOC 2 auditor, ISO lead auditor, smart-contract audit firm, RBI counsel, bug bounty platform, evidence collector tool). 7. Open dependencies + risks (R-COMP-01..08) with owner + mitigation for each. Explicitly captures the three risks called out in the ticket: DPDP rule notification mid-evidence, evidence-collector tool slip, trusted-setup ceremony slip blocking ISO certification. 8. Document hygiene rules: quarterly retros in docs/compliance/retros/, regulator interaction log in docs/compliance/regulator-log.md, evidence pack rotation each quarter. Cross-references docs/plan/bfsi-v1/06-ways-of-working.md for the escalation path and docs/threat_model.md for the attack catalogue that control narratives map to. Calls out the trusted-setup ceremony artefact at docs/cryptography/trusted-setup-ceremony.md as the input to ISO Annex A.5.31 and SOC 2 CC6.1 evidence. [no-test] markdown-only deliverable per ticket. Reviewer: Agent #1.
pulkitpareek18
pushed a commit
that referenced
this pull request
May 28, 2026
First issue of the enterprise risk register at docs/compliance/risk/enterprise-risk-register-v1.md. Captures the 10 baseline commercial, operational, regulatory, strategic, security, and financial risks that the founder, CCO, CRO, and Risk & Audit lead carry on their dashboards. Distinct from docs/threat_model.md, which holds the technical attack catalogue (A-NN rows). Each enterprise risk references the threat-model rows it relates to so the two documents stay bidirectionally linked per the §6.5 operating principle. Document deliverable A40-W1-Mon from docs/plan/bfsi-v1/agents/agent-40-risk-audit.md. Pairs with the compliance roadmap at docs/compliance/compliance-roadmap-v1.md whose §7 holds the thinner compliance-bearing subset; this register is the authoritative copy. References docs/threat_model.md throughout (A-02, A-07, A-09, A-10, A-13, A-17, A-21, A-22, A-28) and docs/cryptography/trusted-setup-ceremony.md (R-ENT-04, R-ENT-07) and docs/compliance/privacy/data-inventory-v1.md (R-ENT-03 scoping). Risks classified by likelihood (1..5) x impact (1..5) with appetite bands accept <= 6, review 7-12, reject >= 13. At v1 all residuals sit in the auto-accept band after mitigation. Cadence is weekly walk by Agent #40, monthly review with Agent #1 + #36 + #42 on the 15th, quarterly board review in the last week of each Q, plus event-driven triggers per §6.3. Sign-offs in §7. [no-test] markdown-only documentation deliverable. Next review 2026-06-01 per A40-W2-Mon ticket which updates the register with commit hashes for closed mitigations.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hotfix for PR #35. Production was 502 for ~3 minutes between 07:21 and 07:25 UTC because the new verifier container failed its healthcheck → `zeroauth-prod` (which depends on it via `condition: service_healthy`) never started.
Root cause
Alpine ships busybox `wget`. Busybox `wget` resolves `localhost` to `::1` (IPv6) first and does NOT fall back to `127.0.0.1` (IPv4) on refusal. The verifier binds `0.0.0.0` which is IPv4-only.
```text
Connecting to localhost:3001 ([::1]:3001)
wget: can't connect to remote host: Connection refused
```
The verifier was running and serving HTTP perfectly the whole time. Only the healthcheck command was wrong.
Manual recovery already done
I SSH'd to the VPS at 07:25 UTC and ran:
```bash
cd /opt/zeroauth && docker compose --profile prod up -d --no-deps zeroauth-prod
```
That started `zeroauth-prod` without waiting for the `service_healthy` dependency. Production has been serving traffic normally since. The API actually IS hitting the verifier (its `VERIFIER_URL=http://zeroauth-verifier:3001\` env was preserved) — the verifier service itself works fine, it's only Docker's healthcheck status that's wrong.
So this PR is a "correctness restoration" not an emergency — the next `docker compose up -d --build --remove-orphans` (e.g. next deploy) would re-introduce the same hang on the dependency-wait without this fix.
What changed
Two-line fix in two places:
Both carry a comment explaining why `localhost` is wrong, so the next operator doesn't revert.
Verified
```bash
ssh root@104.207.143.14 \
'docker exec zeroauth-verifier wget -qO- http://127.0.0.1:3001/health'
{"status":"ok","version":"0.1.0","vkeyAvailable":true,"uptimeSeconds":202}
```
Test plan
🤖 Generated with Claude Code